Introduction

row {data-height=650)

The Project

Overview

Medical insurance is an important issue in the United States, and many factors go into the coverage decisions of insurance companies. The interest surrounds the predictability of medical expenses based on the demographics information provided by the individuals’ medical information. Additionally, smoking is a known carcinogen; it is valuable to examine the effects of smoking on other variables within this data set.

Questions
  1. How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical expense value predict whether the individual is a smoker?
  2. How do medical expenses differ by region, children, age, and bmi?
  3. How could the demographics of a smoker be described?
Data Source

Shiva Dumnawar. 2021. Health Insurance, Version 1. Retrieved 5/24/2024 from https://www.kaggle.com/datasets/shivadumnawar/health-insurance-dataset

The Data

Description of the Variables in the Dataset
  • age: This number represents the individual’s age.
  • sex: Whether the individual’s gender assigned at birth is “male” or “female”.
  • bmi: The value represents the individual’s body mass index number.
  • children: This value is the number of children the individual has.
  • smoker: Can either be “yes” if the individual is a smoker or “no” if the individual is not a smoker.
  • region: The U.S. region the individual is from (southeast, southwest, northeast, or northwest).
  • charges: This number is the value of the individual’s medical expenses.
Variables to Predict & Data Changes
  • charges: Continuous
  • smoker: Classification
  • validation: A 60:40 validation column was created to help in prediction modeling.

Exploration of Data

Summary Statistics

      age            sex                 bmi           children    
 Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
 1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
 Median :39.00   Mode  :character   Median :30.40   Median :1.000  
 Mean   :39.21                      Mean   :30.66   Mean   :1.095  
 3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
 Max.   :64.00                      Max.   :53.13   Max.   :5.000  
    smoker             region             charges     
 Length:1338        Length:1338        Min.   : 1122  
 Class :character   Class :character   1st Qu.: 4740  
 Mode  :character   Mode  :character   Median : 9382  
                                       Mean   :13270  
                                       3rd Qu.:16640  
                                       Max.   :63770  

Looking at these summary statistics, age, bmi, children, and charges have a good variety in ranges for the size of data. It is also clear that the region, sex, and smoker variables are categorical.

Categorical Variable Distributions

The region, sex, and smoker variables are categorical and therefore show up as “character type” on summary statistics, so converting them to factors will tells more about the variables.

Sex (female or male)

# A tibble: 2 × 2
  sex        n
  <chr>  <int>
1 female   662
2 male     676

Smoker (yes or no)

# A tibble: 2 × 2
  smoker     n
  <chr>  <int>
1 no      1064
2 yes      274

Region (southeast, southwest, northeast, or northwest)

# A tibble: 4 × 2
  region        n
  <chr>     <int>
1 northeast   324
2 northwest   325
3 southeast   364
4 southwest   325

row

Frequencies

Question #1

Question:

How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical expense value predict whether the individual is a smoker?

Answer: While the total charges between smokers and non-smokers don’t differ very much, the average individual charge between smokers and non-smokers do differ quite a bit, with average smoker charges much larger than their counterparts. This suggests that this data pool had an unequal number of smokers and non-smokers, but that generally, smokers experienced higher charges than non-smokers. It is possible a higher charge could indicate the individual is a smoker and a regression can confirm this.

row

Total Charges by Smoker Status

row

Average Charges by Smoker Status

Question #2

Question:

How do medical expenses differ by region, children, age, and bmi?

Answer: In terms of region, both charges and average charges look very similar. Overall, the southeast experiences higher total and average charges; the northeast also experiences slightly higher values than the other two regions, but the southeast shows what appears to be an actual significant difference. For the Total & Average Charges by Children, total charges show a decrease in charges with more children; however, for average charges, the average increases up to 3 children, before decreasing; it overall maintains a similar average though. When looking at it by Age, both the total charges and average charges show an increase with age, which is in line with aging. For total and average charges by bmi, there appears to be an increase up to about a 30 bmi, but then stays relatively steady.

Column

Total & Average Charges by Region

Column

Total & Average Charges by Children

Column

Total & Average Charges by Age

Column

Total & Average Charges by BMI

Question #3

Question:

How could the demographics of a smoker be described?

Answer: Looking at the smoker status by region, age, and sex, the only region that appears to have higher levels of smoking is the southeast, with 91 people versus 67 and 58 people in other regions. Generally, the rates of smoking have a slight inverse relationship, decreasing with age. The rates of smoking by sex don’t differ too much, though it does show a higher amount of men smoking than women.

Column

Smoking by Region

Column

Smoking by Age

Column

Smoking by Sex

Continuous

Column

Regression Output

Estimate Std. Error t value Pr(>|t|)
smokeryes 23848.535 413.153 57.723 0.000
age 256.856 11.899 21.587 0.000
(Intercept) -11938.539 987.819 -12.086 0.000
bmi 339.193 28.599 11.860 0.000
children 475.501 137.804 3.451 0.001
regionsoutheast -1035.022 478.692 -2.162 0.031
regionsouthwest -960.051 477.933 -2.009 0.045
regionnorthwest -352.964 476.276 -0.741 0.459
sexmale -131.314 332.945 -0.394 0.693

Column

Residual Assumptions Explorations

Row

Adjusted R-Squared

75 %

RMSE

6062.1

Column

Analysis Summary

Comparing this model to the JMP multiple regression, this one appears to account for more variance. That being said, the Neural model is better than both regression models, as for both training and validation, it has a higher r-sq value and lower RASE value. This suggests that the model accounts for more variance with less errors in prediction, which is ideal for a model. However, from the regression we can see which variables are significant to predicting the charges. According to the R regression model, the significant variables (where p < 0.05) are all variables except northwest, southwest, and sex. This aligns well with the graphs present in question 2.

Classification

Row

Confusion Matrix & Errors

Row

Nominal Logistic

Decision Tree

Boosted Tree

Row

Analysis Summary

By comparing the models with the confusion matrix and the r-sqr values, the boosted model is best. The models all have fairly similar accuracy and sensitivity levels; however, when comparing the r-sqr values, the boosted model has the highest. The boosted model accounts for 76.59% of variability- the highest of all validation r-sqr values. When examining which predictors are significant towards predicting whether an individual is a smoker, all are significant other than region. Charges followed by bmi, and then age are the biggest predictors according to the effect summary.

Conclusion

Summary

By modeling charges and smoker variables, the significant predictors for the two were identified as all predictors except northwest, southwest, and sex for charges, and all predictors except region for smoker.

---
title: "INFO 3200 Health Insurance Dashboard"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

<style>
.navbar {
  background-color: green;
  border-color:white;
}
.navbar-brand {
color:white!important;
}

</style>

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```

```{r load_data}
df <- read_csv("health_insurance.csv")
```

Introduction {data-orientation=rows}
=======================================================================

row {data-height=650)
-----------------------------------------------------------------------

### The Project

##### Overview

Medical insurance is an important issue in the United States, and many factors go into the coverage decisions of insurance companies. The interest surrounds the predictability of medical expenses based on the demographics information provided by the individuals’ medical information. Additionally, smoking is a known carcinogen; it is valuable to examine the effects of smoking on other variables within this data set.


##### Questions

  1.	How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical 
      expense value predict whether the individual is a smoker?
  2.  How do medical expenses differ by region, children, age, and bmi?
  3.  How could the demographics of a smoker be described?
  

##### Data Source

Shiva Dumnawar. 2021. Health Insurance, Version 1. Retrieved 5/24/2024 from https://www.kaggle.com/datasets/shivadumnawar/health-insurance-dataset


### The Data

##### Description of the Variables in the Dataset

* **age**: This number represents the individual’s age.
* **sex**: Whether the individual’s gender assigned at birth is “male” or “female”.
* **bmi**: The value represents the individual’s body mass index number.
* **children**: This value is the number of children the individual has.
* **smoker**: Can either be “yes” if the individual is a smoker or “no” if the individual is not a smoker.  
* **region**: The U.S. region the individual is from (southeast, southwest, northeast, or northwest).
* **charges**: This number is the value of the individual’s medical expenses.

##### Variables to Predict & Data Changes

* **charges**: Continuous
* **smoker**: Classification
* **validation**: A 60:40 validation column was created to help in prediction modeling.


Exploration of Data {data-orientation=rows}
=======================================================================


### Summary Statistics
```{r}
#View data
summary(df)
```
Looking at these summary statistics, age, bmi, children, and charges have a good variety in ranges for the size of data. It is also clear that the region, sex, and smoker variables are categorical.


### Categorical Variable Distributions

The region, sex, and smoker variables are categorical and therefore show up as "character type" on summary statistics, so converting them to factors will tells more about the variables.


```{r, cache=TRUE}
df <- mutate(df,sex=as.factor(sex),
             smoker=as.factor(smoker),region=as.factor(region))
```

#### Sex (female or male)
```{r, cache=TRUE}
as_tibble (select(df,sex) %>%
  table())

```

#### Smoker (yes or no)
```{r, cache=TRUE}
as_tibble (select(df,smoker) %>%
  table())

```

#### Region (southeast, southwest, northeast, or northwest)
```{r, cache=TRUE}
as_tibble (select(df,region) %>%
  table())

```

row {data-width=250}
-----------------------------------------------------------------------
#### Frequencies

![](catdistr.png)


Question #1 {data-orientation=rows}
=======================================================================
### Question:

*How do the medical expenses differ between smokers and non-smokers? Specifically, can the medical expense value predict whether the individual is a smoker?*

**Answer**: While the total charges between smokers and non-smokers don't differ very much, the average individual charge between smokers and non-smokers do differ quite a bit, with average smoker charges much larger than their counterparts. This suggests that this data pool had an unequal number of smokers and non-smokers, but that generally, smokers experienced higher charges than non-smokers. It is possible a higher charge could indicate the individual is a smoker and a regression can confirm this.


row {data-height=500}
-----------------------------------------------------------------------

### Total Charges by Smoker Status

![](Tot$SS.png)


row {data-height=500}
-----------------------------------------------------------------------

### Average Charges by Smoker Status

![](avg$SS.png)


Question #2 {data-orientation=rows}
=======================================================================
### Question:

*How do medical expenses differ by region, children, age, and bmi?*

**Answer**: In terms of region, both charges and average charges look very similar. Overall, the southeast experiences higher total and average charges; the northeast also experiences slightly higher values than the other two regions, but the southeast shows what appears to be an actual significant difference. For the Total & Average Charges by Children, total charges show a decrease in charges with more children; however, for average charges, the average increases up to 3 children, before decreasing; it overall maintains a similar average though. When looking at it by Age, both the total charges and average charges show an increase with age, which is in line with aging. For total and average charges by bmi, there appears to be an increase up to about a 30 bmi, but then stays relatively steady.



Column {data-width=250}
-----------------------------------------------------------------------

### Total & Average Charges by Region

![](regionv$.png)


Column {data-width=250}
-----------------------------------------------------------------------

### Total & Average Charges by Children

![](childrenv$.png)


Column {data-width=250}
-----------------------------------------------------------------------

### Total & Average Charges by Age

![](age.png)


Column {data-width=250}
-----------------------------------------------------------------------

### Total & Average Charges by BMI

![](bmi.png)


Question #3 {data-orientation=rows}
=======================================================================
### Question:

*How could the demographics of a smoker be described?*

**Answer**: Looking at the smoker status by region, age, and sex, the only region that appears to have higher levels of smoking is the southeast, with 91 people versus 67 and 58 people in other regions. Generally, the rates of smoking have a slight inverse relationship, decreasing with age. The rates of smoking by sex don't differ too much, though it does show a higher amount of men smoking than women.


Column {data-width=250}
----------------------------------------------------------------------- 

### Smoking by Region

![](smokingvregion.png)


Column {data-width=250}
-----------------------------------------------------------------------

### Smoking by Age

![](smokingage.png)


Column {data-width=250}
-----------------------------------------------------------------------

### Smoking by Sex

![](smokingvsex.png)


Continuous {data-orientation=rows}
=======================================================================

```{r,include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
charges_lm <- lm(charges ~ .,data = df)
summary(charges_lm)
```

```{r,include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(charges_lm)
```

Column
-----------------------------------------------------------------------
### Regression Output

```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(charges_lm)$coef, digits = 3) #pretty table output
summary(charges_lm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(charges_lm))[,4])  
out <- coef(summary(charges_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

Column
-----------------------------------------------------------------------

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(charges_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------
### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(charges_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'))
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(charges_lm)$sigma,2)
valueBox(Sig)
```

Column
-----------------------------------------------------------------------
![](contmc.png)

### Analysis Summary
Comparing this model to the JMP multiple regression, this one appears to account for more variance. That being said, the Neural model is better than both regression models, as for both training and validation, it has a higher r-sq value and lower RASE value. This suggests that the model accounts for more variance with less errors in prediction, which is ideal for a model. However, from the regression we can see which variables are significant to predicting the charges. According to the R regression model, the significant variables (where p < 0.05) are all variables except northwest, southwest, and sex. This aligns well with the graphs present in question 2.


Classification {data-orientation=rows}
=======================================================================

Row 
-------------------------------------
    
### Confusion Matrix & Errors
    
![](scmerror.png)
   
Row {.tabset .tabset-fade}
-------------------------------------
   
### Nominal Logistic

![](snlogistic.png)
 
### Decision Tree
    
![](spart.png)

### Boosted Tree

![](boosted.png)

Row 
-------------------------------------
### Analysis Summary
By comparing the models with the confusion matrix and the r-sqr values, the boosted model is best. The models all have fairly similar accuracy and sensitivity levels; however, when comparing the r-sqr values, the boosted model has the highest. The boosted model accounts for 76.59% of variability- the highest of all validation r-sqr values. When examining which predictors are significant towards predicting whether an individual is a smoker, all are significant other than region. Charges followed by bmi, and then age are the biggest predictors according to the effect summary.

Conclusion
=======================================================================
### Summary

By modeling charges and smoker variables, the significant predictors for the two were identified as all predictors except northwest, southwest, and sex for charges, and all predictors except region for smoker.